home *** CD-ROM | disk | FTP | other *** search
-
- RFC 789
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Vulnerabilities of Network Control Protocols: An Example
-
-
-
- Eric C. Rosen
-
-
- Bolt Beranek and Newman Inc.
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- This paper has appeared in the January 1981 edition of the
-
- SIGSOFT Software Engineering Notes, and will soon appear in the
-
- SIGCOMM Computer Communications Review. It is being circulated
-
- as an RFC because it is thought that it may be of interest to a
-
- wider audience, particularly to the internet community. It is a
-
- case study of a particular kind of problem that can arise in
-
- large distributed systems, and of the approach used in the
-
- ARPANET to deal with one such problem.
-
-
- On October 27, 1980, there was an unusual occurrence on the
-
- ARPANET. For a period of several hours, the network appeared to
-
- be unusable, due to what was later diagnosed as a high priority
-
- software process running out of control. Network-wide
-
- disturbances are extremely unusual in the ARPANET (none has
-
- occurred in several years), and as a result, many people have
-
- expressed interest in learning more about the etiology of this
-
- particular incident. The purpose of this note is to explain what
-
- the symptoms of the problem were, what the underlying causes
-
- were, and what lessons can be drawn. As we shall see, the
-
- immediate cause of the problem was a rather freakish hardware
-
- malfunction (which is not likely to recur) which caused a faulty
-
- sequence of network control packets to be generated. This faulty
-
- sequence of control packets in turn affected the apportionment of
-
- software resources in the IMPs, causing one of the IMP processes
-
- to use an excessive amount of resources, to the detriment of
-
- other IMP processes. Restoring the network to operational
-
-
- - 1 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- condition was a relatively straightforward task. There was no
-
- damage other than the outage itself, and no residual problems
-
- once the network was restored. Nevertheless, it is quite
-
- interesting to see the way in which unusual (indeed, unique)
-
- circumstances can bring out vulnerabilities in network control
-
- protocols, and that shall be the focus of this paper.
-
-
- The problem began suddenly when we discovered that, with
-
- very few exceptions, no IMP was able to communicate reliably with
-
- any other IMP. Attempts to go from a TIP to a host on some other
-
- IMP only brought forth the "net trouble" error message,
-
- indicating that no physical path existed between the pair of
-
- IMPs. Connections which already existed were summarily broken.
-
- A flood of phone calls to the Network Control Center (NCC) from
-
- all around the country indicated that the problem was not
-
- localized, but rather seemed to be affecting virtually every IMP.
-
-
- As a first step towards trying to find out what the state of
-
- the network actually was, we dialed up a number of TIPs around
-
- the country. What we generally found was that the TIPs were up,
-
- but that their lines were down. That is, the TIPs were
-
- communicating properly with the user over the dial-up line, but
-
- no connections to other IMPs were possible.
-
-
- We tried manually restarting a number of IMPs which are in
-
- our own building (after taking dumps, of course). This procedure
-
- initializes all of the IMPs' dynamic data structures, and will
-
-
- - 2 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- often clear up problems which arise when, as sometimes happens in
-
- most complex software systems, the IMPs' software gets into a
-
- "funny" state. The IMPs which were restarted worked well until
-
- they were connected to the rest of the net, after which they
-
- exhibited the same complex of symptoms as the IMPs which had not
-
- been restarted.
-
-
- From the facts so far presented, we were able to draw a
-
- number of conclusions. Any problem which affects all IMPs
-
- throughout the network is usually a routing problem. Restarting
-
- an IMP re-initializes the routing data structures, so the fact
-
- that restarting an IMP did not alleviate the problem in that IMP
-
- suggested that the problem was due to one or more "bad" routing
-
- updates circulating in the network. IMPs which were restarted
-
- would just receive the bad updates from those of their neighbors
-
- which were not restarted. The fact that IMPs seemed unable to
-
- keep their lines up was also a significant clue as to the nature
-
- of the problem. Each pair of neighboring IMPs runs a line
-
- up/down protocol to determine whether the line connecting them is
-
- of sufficient quality to be put into operation. This protocol
-
- involves the sending of HELLO and I-HEARD-YOU messages. We have
-
- noted in the past that under conditions of extremely heavy CPU
-
- utilization, so many buffers can pile up waiting to be served by
-
- the bottleneck CPU process, that the IMPs are unable to acquire
-
- the buffers needed for receiving the HELLO or I-HEARD-YOU
-
- messages. If a condition like this lasts for any length of time,
-
-
- - 3 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- the IMPs may not be able to run the line up/down protocol, and
-
- lines will be declared down by the IMPs' software. On the basis
-
- of all these facts, our tentative conclusion was that some
-
- malformed update was causing the routing process in the IMPs to
-
- use an excessive amount of CPU time, possibly even to be running
-
- in an infinite loop. (This would be quite a surprise though,
-
- since we tried very hard to protect ourselves against malformed
-
- updates when we designed the routing process.) As we shall see,
-
- this tentative conclusion, although on the right track, was not
-
- quite correct, and the actual situation turned out to be much
-
- more complex.
-
-
- When we examined core dumps from several IMPs, we noted that
-
- most, in some cases all, of the IMPs' buffers contained routing
-
- updates waiting to be processed. Before describing this
-
- situation further, it is necessary to explain some of the details
-
- of the routing algorithm's updating scheme. (The following
-
- explanation will of course be very brief and incomplete. Readers
-
- with a greater level of interest are urged to consult the
-
- references.) Every so often, each IMP generates a routing update
-
- indicating which other IMPs are its immediate neighbors over
-
- operational lines, and the average per-packet delay (in
-
- milliseconds) over that line. Every IMP is required to generate
-
- such an update at least once per minute, and no IMP is permitted
-
- to generate more than a dozen such updates over the course of a
-
- minute. Each update has a 6-bit sequence number which is
-
-
- - 4 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- advanced by 1 (modulo 64) for each successive update generated by
-
- a particular IMP. If two updates generated by the same IMP have
-
- sequence numbers n and m, update n is considered to be LATER
-
- (i.e., more recently generated) than update m if and only if one
-
- of the following two conditions hold:
-
-
-
- (a) n > m, and n - m <= 32
-
- (b) n < m, and m - n > 32
-
-
- (where the comparisons and subtractions treat n and m as unsigned
-
- 6-bit numbers, with no modulus). When an IMP generates an
-
- update, it sends a copy of the update to each neighbor. When an
-
- IMP A receives an update u1 which was generated by a different
-
- IMP B, it first compares the sequence number of u1 with the
-
- sequence number of the last update, u2, that it accepted from B.
-
- If this comparison indicates that u2 is LATER than u1, u1 is
-
- simply discarded. If, on the other hand, u1 appears to be the
-
- LATER update, IMP A will send u1 to all its neighbors (including
-
- the one from which it was received). The sequence number of u1
-
- will be retained in A's tables as the LATEST received update from
-
- B. Of course, u1 is always accepted if A has seen no previous
-
- update from B. Note that this procedure is designed to ensure
-
- that an update generated by a particular IMP is received,
-
- unchanged, by all other IMPs in the network, IN THE PROPER
-
- SEQUENCE. Each routing update is broadcast (or flooded) to all
-
- IMPs, not just to immediate neighbors of the IMP which generated
-
-
- - 5 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- the update (as in some other routing algorithms). The purpose of
-
- the sequence numbers is to ensure that all IMPs will agree as to
-
- which update from a given IMP is the most recently generated
-
- update from that IMP.
-
-
- For reliability, there is a protocol for retransmitting
-
- updates over individual links. Let X and Y be neighboring IMPs,
-
- and let A be a third IMP. Suppose X receives an update which was
-
- generated by A, and transmits it to Y. Now if in the next 100 ms
-
- or so, X does not receive from Y an update which originated at A
-
- and whose sequence number is at least as recent as that of the
-
- update X sent to Y, X concludes that its transmission of the
-
- update did not get through to Y, and that a retransmission is
-
- required. (This conclusion is warranted, since an update which
-
- is received and adjudged to be the most recent from its
-
- originating IMP is sent to all neighbors, including the one from
-
- which it was received.) The IMPs do not keep the original update
-
- packets buffered pending retransmission. Rather, all the
-
- information in the update packet is kept in tables, and the
-
- packet is re-created from the tables if necessary for a
-
- retransmission.
-
-
- This transmission protocol ("flooding") distributes the
-
- routing updates in a very rapid and reliable manner. Once
-
- generated by an IMP, an update will almost always reach all other
-
- IMPs in a time period on the order of 100 ms. Since an IMP can
-
- generate no more than a dozen updates per minute, and there are
-
- - 6 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- 64 possible sequence numbers, sequence number wrap-around is not
-
- a problem. There is only one exception to this. Suppose two
-
- IMPs A and B are out of communication for a period of time
-
- because there is no physical path between them. (This may be due
-
- either to a network partition, or to a more mundane occurrence,
-
- such as one of the IMPs being down.) When communication is
-
- re-established, A and B have no way of knowing how long they have
-
- been out of communication, or how many times the other's sequence
-
- numbers may have wrapped around. Comparing the sequence number
-
- of a newly received update with the sequence number of an update
-
- received before the outage may give an incorrect result. To deal
-
- with this problem, the following scheme is adopted. Let t0 be
-
- the time at which IMP A receives update number n generated by IMP
-
- B. Let t1 be t0 plus 1 minute. If by t1, A receives no update
-
- generated by B with a LATER sequence number than n, A will accept
-
- any update from B as being more recent than n. So if two IMPs
-
- are out of communication for a period of time which is long
-
- enough for the sequence numbers to have wrapped around, this
-
- procedure ensures that proper resynchronization of sequence
-
- numbers is effected when communication is re-established.
-
-
- There is just one more facet of the updating process which
-
- needs to be discussed. Because of the way the line up/down
-
- protocol works, a line cannot be brought up until 60 seconds
-
- after its performance becomes good enough to warrant operational
-
- use. (Roughly speaking, this is the time it takes to determine
-
-
- - 7 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- that the line's performance is good enough.) During this
-
- 60-second period, no data is sent over the line, but routing
-
- updates are transmitted. Remember that every node is required to
-
- generate a routing update at least once per minute. Therefore,
-
- this procedure ensures that if two IMPs are out of communication
-
- because of the failure of some line, each has the most recent
-
- update from the other by the time communication is
-
- re-established.
-
-
- This very short introduction to the routing algorithm's
-
- updating protocol should provide enough background to enable the
-
- reader to understand the particular problem under discussion;
-
- further justification and detail can be found in the references.
-
-
- Let us return now to the discussion of the network outage.
-
- I have already mentioned that the core dumps showed almost all
-
- buffers holding routing updates which were waiting to be
-
- processed. Close inspection showed that all the updates were
-
- from a single IMP, IMP 50. By a strange "coincidence," IMP 50
-
- had been malfunctioning just before the network-wide outage
-
- occurred, and was off the net during the period of the outage.
-
- Hence it was not generating any updates during the period of the
-
- outage. In addition, IMP 29, an immediate neighbor of IMP 50,
-
- was also suffering hardware malfunctions (in particular, dropping
-
- bits), but was up (though somewhat flakey) while the network was
-
- in bad shape. Furthermore, the malfunction in IMP 50 had to do
-
- with its ability to communicate properly with the neighboring IMP
-
- - 8 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- 29. Although we did not yet understand how it was possible for
-
- so many updates from one IMP to be extant simultaneously, we did
-
- understand enough to be able to get the network to recover. All
-
- that was necessary was to patch the IMPs to disregard any updates
-
- from IMP 50, which after all was down anyway. When the network
-
- is operating normally, broadcasting a patch to all IMPs can be
-
- done in a matter of minutes. With the network operating as it
-
- was during the period of the outage, this can take as much as 3
-
- or 4 hours. (Remember that the IMPs are generally unmanned, and
-
- that the only way of controlling them from the NCC is via the
-
- network itself. This is perfectly satisfactory when an outage
-
- affects only a small group of IMPs, but is an obvious problem
-
- when the outage has network-wide effects.) This procedure was
-
- fully successful in bringing the network back up.
-
-
- When we looked closely at the dumps, we saw that not only
-
- were all the updates on the queue from IMP 50, but they all had
-
- one of three sequence numbers (either 8, 40, or 44), and were
-
- ordered in the queue as follows:
-
- 8, 40, 44, 8, 40, 44, 8, 40, 44, ... Note that by the definition
-
- of LATER, 44 is LATER than 40 (44 > 40 and 44 - 40 <= 32), 40 is
-
- LATER than 8 (40 > 8 and 40 - 8 <= 32), and 8 is LATER than 44
-
- (8 < 44 and 44 - 8 > 32). Given the presence of three updates
-
- from the same IMP with these three sequence numbers, this is what
-
- would be expected. Since each update is LATER than one of the
-
- others, a cycle is formed which keeps the three updates floating
-
-
- - 9 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- around the network indefinitely. Thus the IMPs spend most of
-
- their CPU time and buffer space in processing these updates. The
-
- problem was to figure out how these three updates could possibly
-
- have existed at the same time. After all, getting from update 8
-
- to update 40 should require 2 or 3 full minutes, plus 31
-
- intervening sequence numbers. So how could 8 still be around
-
- when 40 was generated, especially since no updates with
-
- intervening sequence numbers were present?
-
-
- Our first thought was that maybe the real-time clock in IMP
-
- 50 was running one or two orders of magnitude faster than normal,
-
- invalidating our assumptions about the maximum number of updates
-
- which could be generated in a given time. An alternative
-
- hypothesis suggested itself however when we looked at the binary
-
- representations of the three sequence numbers:
-
-
- 8 - 001000
-
- 40 - 101000
-
- 44 - 101100
-
-
- Note that 44 has only one more bit than 40, which has only one
-
- more bit than 8. Furthermore, the three different updates were
-
- completely identical, except for their sequence numbers. This
-
- suggests that there was really only one update, 44, whose
-
- sequence number was twice corrupted by dropped bits. (Of course,
-
- it's also possible that the "real" update was 8, and was
-
- corrupted by added bits. However, bit-dropping has proven itself
-
-
- - 10 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- to be a much more common sort of hardware malfunction than
-
- bit-adding, although spontaneously dropped bits may sometimes
-
- come back on spontaneously.)
-
-
- Surely, the reader will object, there must be protection
-
- against dropped bits. Yes there is protection, but apparently
-
- not enough. The update packets themselves are checksummed, so a
-
- dropped bit in an update packet is readily detected. Remember
-
- though that if an update needs to be retransmitted, it is
-
- recreated from tabled information. For maximal reliability, the
-
- tables must be checksummed also, and the checksum must be
-
- recomputed every time the table is accessed. However, this would
-
- require either a large number of CPU cycles (for frequent
-
- checksumming of a large area of memory) or a large amount of
-
- memory (to store the checksums for a lot of small areas). Since
-
- CPU cycles and memory are both potentially scarce resources, this
-
- did not seem to us to be a cost-effective way to deal with
-
- problems that arise, say, once per year (this is the first such
-
- problem encountered in a year and a half of running this routing
-
- algorithm). Time and space can be saved by recomputing the
-
- checksum at a somewhat slower frequency, but this is less
-
- reliable, in that it allows a certain number of dropped bits to
-
- "fall between the cracks." It seems likely then that one of the
-
- malfunctioning IMPs had to retransmit update 44 at least twice,
-
- (recreating it each time from tabled information), retransmitting
-
- it at least once with the corrupted sequence number 40, and at
-
-
- - 11 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- least once with the corrupted sequence number 8. This would
-
- cause those three sequence numbers to be extant in the network
-
- simultaneously, even though protocol is supposed to ensure that
-
- this is impossible.
-
-
- Actually, the detection of dropped bits is most properly a
-
- hardware function. The next generation of IMP hardware (the "C30
-
- IMP") will be able to detect and correct all single-bit errors,
-
- and will detect all other bit errors. Uncorrectable bit errors
-
- will cause the IMP to go into its "loader/dumper." (An IMP in
-
- its loader/dumper is not usable for transferring data, and is
-
- officially in the "down" state. However, an IMP in its
-
- loader/dumper is easily controllable from the NCC, and can be
-
- restarted or reloaded without on-site intervention.) Current
-
- hardware does have parity checking (which should detect single
-
- dropped bits), but this feature has had to be turned off since
-
- (a) there are too many spurious parity "errors," i.e., most of
-
- the time when the machines complain of parity errors there don't
-
- really seem to be any, and (b) parity errors cause the machines
-
- to simply halt, rather than go into their loader/dumpers, which
-
- means that on-site intervention is required to restart them.
-
-
- Pending the introduction of improved hardware, what can be
-
- done to prevent problems like this from recurring in the future?
-
- It is easy to think of many ways of avoiding this particular
-
- problem, especially if one does not consider the problems that
-
- may arise from the "fixes." For example, we might be able to
-
- - 12 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- avoid this sort of problem by spending a lot more CPU cycles on
-
- checksumming, but this may be too expensive because of the side
-
- effects it would introduce. (Also, it is not clear that any
-
- memory checksumming strategy can be totally free of "cracks.") A
-
- very simple and conservative fix to prevent this particular
-
- problem from recurring is to modify clause (a) of the definition
-
- of LATER so that the "<=" is replaced by "<" (strictly less
-
- than). We will implement this fix, but it cannot be guaranteed
-
- that no related problems will ever arise.
-
-
- What is really needed is not some particular fix to the
-
- routing algorithm, but a more general fix. In some sense, the
-
- problem we saw was not really a routing problem. The routing
-
- code was working correctly, and the routes that were generated
-
- were correct and consistent. The real problem is that a freakish
-
- hardware malfunction caused a high priority process to run wild,
-
- devouring resources needed by other processes, thereby making the
-
- network unusable. The fact that the wild process was the routing
-
- process is incidental. In designing the routing process, we
-
- carefully considered the amount of resource utilization it would
-
- require. By strictly controlling and limiting the rate at which
-
- updates can be generated, we tried to prevent any situation in
-
- which the routing process would make excessive demands on the
-
- system. As we have seen though, even our carefully designed
-
- mechanisms were unable to protect against every possible sort of
-
- hardware failure. We need a better means of detecting that some
-
-
- - 13 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- high priority process in the IMP, despite all the safeguards we
-
- have built in, is still consuming too many resources. Once this
-
- is detected, the IMP can be automatically placed in its
-
- loader/dumper. In the case under discussion, we would have liked
-
- to have all the IMPs go into their loader/dumpers when the
-
- problem arose. This would have enabled us to re-initialize and
-
- restart all the IMPs much more quickly. (Although restarting
-
- individual IMPs did little good, restarting all the IMPs
-
- simultaneously would have cleared up the problem instantly, since
-
- all routing tables in all IMPs would have been initialized
-
- simultaneously.) It took us no more than an hour to figure out
-
- how to restore the network; several additional hours were
-
- required because it took so long for us to gain control of the
-
- misbehaving IMPs and get them back to normal. A built-in
-
- software alarm system (assuming, of course, that it was not
-
- subject to false alarms) might have enabled us to restore the
-
- network more quickly, significantly reducing the duration of the
-
- outage. This is not to say that a better alarm and control
-
- system could ever be a replacement for careful study and design
-
- which attempts to properly distribute the utilization of
-
- important resources, but only that it is a necessary adjunct, to
-
- handle the cases that will inevitably fall between the cracks of
-
- even the most careful design.
-
-
-
-
-
-
-
- - 14 -
-
- RFC 789 Bolt Beranek and Newman Inc.
- Eric C. Rosen
-
- REFERENCES
-
-
-
- "The New Routing Algorithm for the ARPANET," IEEE TRANSACTIONS ON
-
- COMMUNICATIONS, May 1980, J.M. McQuillan, I. Richer, E.C. Rosen.
-
-
- "The Updating Protocol of ARPANET's New Routing Algorithm,"
-
- COMPUTER NETWORKS, February 1980, E.C. Rosen.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- - 15 -
-
-